Chapter 4. Clustering and classification

I completed all the DataCamp exercises prior to proceeding to this exercise. I also looked up some details how to use the RMarkdown more productively and hopefully this report will be clearer than the previous ones.

Today we are looking at the Boston crime

Boston Data set information:

Variables Description
crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town.
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox nitrogen oxides concentration (parts per 10 million).
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
ptratio pupil-teacher ratio by town.
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat lower status of the population (percent).
medv median value of owner-occupied homes divided by $1000s.
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
require("easypackages") # for loading and installing many packages at one time
packages_to_load <- c("broom", "dplyr", "MASS", "tidyverse", "corrplot", "ggplot2", "GGally", "caret", "devtools", "ggthemes", "scales", "plotly")
packages(packages_to_load, prompt = TRUE) # lock'n'load install/load
#Additionally
#install_github("fawda123/ggord") # Installed from Github for vizualization
library(ggord)
# To empty the memory after the excercise before this
# rm(list=ls())
# Load data
# setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # to set working directory to source file

# load the data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

All are numeric variables except chas and rad. The data has demographical data of boston including tax and other information possibly linked to crime rates within the city.

my_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) +
    geom_point() +
    geom_smooth(method=loess, fill="red", color="red", ...) +
    geom_smooth(method=lm, fill="blue", color="blue", ...)
  p
}
g = ggpairs(Boston,columns = c(1:14), lower = list(continuous = my_fn))
g

  1. It seems that indus, nox, rad, tax and lstat have positive correlation with crime, and dis, black and medv have negative correlation with crime.
  2. Many of the variables seems to have non-linear relationships (red lines in the picture), however the variables with high correlation seems to be more linear.
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

More focused figure on correlation, since they are hardly visible with this many variables.

# Scaling the variables with mean and standard deviation with scale()
Boston_scaled <- as.data.frame(scale(Boston))
# create a quantile vector of crime and print it
bins <- quantile(Boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crim'
crime <- cut(Boston_scaled$crim, breaks = bins, include.lowest= TRUE, label = c("low","med_low","med_high", "high"))

# remove original crim from the dataset
Boston_scaled <- dplyr::select(Boston_scaled, -crim)

# add the new categorical value to scaled data
Boston_scaled <- data.frame(Boston_scaled, crime)

# number of rows in the Boston dataset
n <-nrow(Boston)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- Boston_scaled[ind,]

# create test set
test <- Boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <-test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

summary(train)
##        zn               indus               chas         
##  Min.   :-0.48724   Min.   :-1.51549   Min.   :-0.27233  
##  1st Qu.:-0.48724   1st Qu.:-0.86683   1st Qu.:-0.27233  
##  Median :-0.48724   Median :-0.19558   Median :-0.27233  
##  Mean   :-0.01734   Mean   : 0.02999   Mean   :-0.03844  
##  3rd Qu.: 0.04872   3rd Qu.: 1.01499   3rd Qu.:-0.27233  
##  Max.   : 3.58609   Max.   : 2.42017   Max.   : 3.66477  
##       nox                 rm                age          
##  Min.   :-1.46443   Min.   :-3.87641   Min.   :-2.33313  
##  1st Qu.:-0.87761   1st Qu.:-0.57910   1st Qu.:-0.79577  
##  Median :-0.14407   Median :-0.11334   Median : 0.33305  
##  Mean   : 0.01955   Mean   :-0.02494   Mean   : 0.01982  
##  3rd Qu.: 0.61319   3rd Qu.: 0.45525   3rd Qu.: 0.90413  
##  Max.   : 2.72965   Max.   : 3.55153   Max.   : 1.11639  
##       dis                rad                tax          
##  Min.   :-1.26582   Min.   :-0.98187   Min.   :-1.31269  
##  1st Qu.:-0.80637   1st Qu.:-0.63733   1st Qu.:-0.75495  
##  Median :-0.32661   Median :-0.52248   Median :-0.39895  
##  Mean   :-0.03654   Mean   : 0.03555   Mean   : 0.04215  
##  3rd Qu.: 0.51946   3rd Qu.: 1.65960   3rd Qu.: 1.52941  
##  Max.   : 3.95660   Max.   : 1.65960   Max.   : 1.79642  
##     ptratio             black              lstat        
##  Min.   :-2.70470   Min.   :-3.90333   Min.   :-1.5296  
##  1st Qu.:-0.48756   1st Qu.: 0.19715   1st Qu.:-0.7640  
##  Median : 0.29768   Median : 0.38442   Median :-0.1664  
##  Mean   : 0.02534   Mean   :-0.01071   Mean   : 0.0273  
##  3rd Qu.: 0.80578   3rd Qu.: 0.43330   3rd Qu.: 0.6266  
##  Max.   : 1.63721   Max.   : 0.44062   Max.   : 3.5453  
##       medv               crime    
##  Min.   :-1.90634   low     : 95  
##  1st Qu.:-0.63692   med_low :103  
##  Median :-0.15579   med_high:100  
##  Mean   :-0.03298   high    :106  
##  3rd Qu.: 0.25195                 
##  Max.   : 2.98650

Now we have generated the test dataset with the new categorial variable “crime”, which is based on the quantiles of the original numeric variable.

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2351485 0.2549505 0.2475248 0.2623762 
## 
## Group means:
##                   zn      indus       chas        nox         rm
## low       0.94440483 -0.8860431 -0.1065564 -0.8565321  0.3760637
## med_low  -0.06533749 -0.3106752 -0.0812077 -0.5700430 -0.1289125
## med_high -0.38347752  0.2070187  0.1607520  0.3788041  0.1350744
## high     -0.48724019  1.0149946 -0.1237592  1.0386948 -0.4342557
##                 age        dis        rad        tax     ptratio
## low      -0.8735571  0.8276950 -0.6953589 -0.7285311 -0.36259909
## med_low  -0.3256663  0.3411869 -0.5458997 -0.4568972 -0.07184181
## med_high  0.4146562 -0.3977027 -0.3927076 -0.2881689 -0.33328026
## high      0.7836908 -0.8373930  1.6596029  1.5294129  0.80577843
##                black       lstat         medv
## low       0.38675456 -0.73929892  0.444285615
## med_low   0.34981973 -0.11922776  0.000127677
## med_high  0.08575804 -0.03644602  0.227810152
## high     -0.80826193  0.91685446 -0.738929755
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.138009125  0.74028583 -1.09696055
## indus   -0.003847934 -0.32548096  0.08754348
## chas    -0.067664075 -0.06180598  0.04520960
## nox      0.285937634 -0.82885041 -1.24525429
## rm      -0.086150700 -0.05590585 -0.09501534
## age      0.316543734 -0.32356275 -0.11492945
## dis     -0.140728653 -0.36444382  0.29952893
## rad      3.255014878  0.80632853 -0.25830319
## tax     -0.067812124  0.08209102  0.81565185
## ptratio  0.118808332  0.09934037 -0.28106515
## black   -0.178445563  0.01398437  0.14550745
## lstat    0.171614667 -0.19522951  0.48709027
## medv     0.141566523 -0.47310859 -0.09784005
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9494 0.0384 0.0122
g <- ggord(lda.fit, train$crime, ellipse_pro = 0.95, vec_ext = 2, alpha_el = 0.3, size = 2)
g  + theme_tufte()  + geom_rangeframe()

  1. We get the points seperated only by the 1st LD and rad parameter in the high group.
  2. From the LDA summary, we can see that how the first linear discriminant explain 94.72 % of the between-group variance in the Boston dataset.
  3. Also the rad (= index of accessibility to radial highways) has relatively high value in the first discriminant, so it proves to be uselful in the classification. rad had negative correlation with crime.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       17      13        2    0
##   med_low    1      17        5    0
##   med_high   1       9       16    0
##   high       0       0        1   20

In the high -crime quartile, we could predict all cases, but for other classes we didn’t do so well. However, LDA seems to be able to predict our cases with quite good accuracy.

# k-means clustering
BostonS <- scale(Boston) # standardize variables

wss <- (nrow(BostonS)-1)*sum(apply(BostonS,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(BostonS, 
    centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

km <-kmeans(BostonS, centers = 4)

# plot the Boston dataset with clusters
pairs(BostonS, col = km$cluster)

We see similar results than in LDA

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

Now we can see the 3D picture of the clusters!